pkg: add reusable opencost + prom modules for FinOps v2#492
Conversation
e43924c to
4ded49b
Compare
4ded49b to
226ff2b
Compare
Moves the Prometheus client + cost math out of internal/ and into pkg/ so koala-backend and skyhook-connector can share the code for the FinOps v2 Costs Explorer. pkg/prom — extracted from internal/prometheus: - Client, HTTPTransport, Discover(), and the metric-query builders. - Optional Headers map on HTTPTransport for auth-protected backends (Authorization, X-Scope-OrgID, ...) — applied after Accept. - Memory-scrape dedupe (max-by container inside the outer sum) folded into BuildQuery / BuildNamespaceQuery / BuildClusterQuery so callers do not double-count series from multiple scrape jobs. - internal/prometheus is now a thin facade re-exporting via aliases.go; client.go and discovery.go delegate, queries.go moved under pkg/. pkg/opencost has two paths into the same data: - REST (ComputeCostSummary, ComputeCostTrend) hits OpenCost's /allocation API. windowHours() normalizes /allocation.totalCost (summed over the window) to dollars/hr so callers project monthly via x 730 regardless of window. Coarse default steps (24h->6h, 7d->1d, 30d->2d) fit CAC's 30s proxy budget. Includes pod/controller aggregation, client-side namespace post-filter for OpenCost versions that silently ignore the REST filter param, removal of synthetic __unallocated__ drilldown rows, and per-row NetworkCost surfacing. - PromQL (ComputeCostSummaryFromProm, ComputeCostTrendFromProm, ComputeWorkloadsFromProm, ComputeNodeCosts) extracted from internal/opencost/handlers.go. The workloads path takes a PodOwnerLookup callback so pkg/opencost stays free of k8s.io/client-go; radar supplies it from its informer cache, koala-backend / connector can supply it from CAC-served pod metadata. internal/opencost/handlers.go shrinks 688 -> 147 lines; the four endpoints (/summary, /workloads, /trend, /nodes) keep their JSON shape unchanged.
The endpoint accepted ?url=<any HTTP(S) URL> and called SetURL() to redirect all subsequent Prometheus queries to that host. The scheme was validated but the host was not, and radar binds to 0.0.0.0 by default — so anyone reachable on the listen port could exfiltrate the configured --prometheus-header credentials to an arbitrary host or port-scan internal services via Prometheus query latency. The override was undocumented and unused by the radar UI (it only POSTs /prometheus/connect with no body or params). Operators set the Prometheus URL via --prometheus-url at startup; there's no UX cost to removing the per-request override. Also removes the now-orphaned (*Client).SetURL method. Flagged by CodeQL go/request-forgery (alert #330).
- CLAUDE.md: drop the "Active feature branches" section. CLAUDE.md is shared, checked-in, and loaded into every Claude session — temporal per-branch state has the wrong lifetime here. - pkg/prom, pkg/opencost: rewrite docstrings that named internal projects and proxy components. pkg/* is open source; comments now describe the abstraction generically (direct HTTP / port-forward / tunneled proxy) without naming specific downstream consumers. - internal/opencost, internal/prometheus: same scrub on the two comments that referenced those names. - pkg/opencost/compute.go: drop the "legacy" qualifier on ComputeCostSummaryFromProm — it's the path radar's runtime actually uses, not deprecated.
Pre-merge audit of pkg/prom and pkg/opencost for over-built abstractions and unreachable code. Net -100 LOC, no behavior change. Removed: - Client.Transport() / RESTClient.Transport() / RESTClient.Address() — unused getters with no callers in the repo. - json.Number branch in parseDataPoint — unreachable; default json.Unmarshal yields float64, not json.Number. - stepHours() in trend.go — windowHours() already handles "m"/"h"/"d"/"w" correctly, so the wrapper's fallback switch was dead code. - TrendPromOptions.Start/End/Step + resolveTrendRange's override branches — no caller set them; the function only ever receives a Range string. - Lowercase categoryUsesContainerFilter alias in internal/prometheus — one call site, renamed to use the canonical CategoryUsesContainerFilter. Tightened: - Unexported metricsNamespaces / skipNamespaces — internal scoring data, no external callers. - Consolidated lastValuePerNamespace into lastValuePerLabel (general helper already used by nodes.go). Comments scrubbed: - "Exposed so callers..." speculation on WellKnownLocations / ScoreService. - "Retained as a method on *Client so existing discovery call sites..." diff-history phrasing on probe. - "pre-fix behavior" / "mirrors radar's prior behavior" / "default compute path for the finops backend" — diff-history references and one leftover internal-project name in rest_client.go.
internal/prometheus/aliases.go and internal/opencost/types.go were
intra-package re-exports of pkg/prom and pkg/opencost symbols. They had
zero external callers; their only "users" were the same-package files
saving 5 characters by writing BuildQuery instead of prom.BuildQuery.
Removed both. Callers updated to use the qualified form (prom.BuildQuery,
pkgopencost.CostSummary). The qualified form is more informative — readers
see which package owns each symbol — and the deletion removes 75 lines of
indirection that no API consumer needed.
Affected: internal/prometheus/{client,discovery,handlers,queries_test,
handlers_test}.go, internal/opencost/handlers.go, internal/traffic/
{istio,caretta}.go. No functional change; full test suite green.
max-w-lg (~512px) was too narrow for the content density — multi-paragraph sections wrapped awkwardly with tons of empty horizontal space available on the page. Bumped to max-w-2xl (~672px) for a more comfortable reading width.
The extraction collapsed several error paths into typed Reason codes without logging the underlying cause. That preserved the in-band UI contract but left operators flying blind when something actually broke. This restores the pre-extraction logging behavior and adds the cases the extraction made possible. pkg/prom: - Probe now distinguishes ProbeReasonAuthError (HTTP 401/403) from generic transport errors so callers can flag credential failures specifically. - HTTPError doc updated to describe the actual contract: callers extract status via errors.As (matches the new Probe behavior). internal/prometheus: - probe() switches on every non-OK ProbeReason: auth failures and empty instances go to errorlog (operator-visible); not-prometheus, prom-error and transport errors go to stdlib log (audit trail). Replaces the silent fall-through that the old per-candidate logging used to cover. - EnsureConnected logs the reason when a cached connection fails its probe and we fall back to rediscovery. pkg/opencost: - ComputeCostSummaryFromProm gets the nil-client guard the three sibling ComputeXFromProm functions already had — closes a real context-switch race where client.Prom() can return nil between EnsureConnected and use. - ComputeWorkloadsFromProm gets the ReasonNoMetrics guard the summary path already had; empty results no longer masquerade as Available=true with an empty list. - Every Query / QueryRange / GetAllocation error that collapses to ReasonQueryError is now logged with the underlying message at the call site (operators were getting "query_error" with no context before). - Best-effort usage queries log when they fail (efficiency / idle would otherwise silently render as 0 — indistinguishable from a low-util workload). - ComputeCostTrend skips buckets with no parseable Start timestamp instead of stamping their data points at the Unix epoch.
Caught by a second-pass comment audit. Same OSS-hygiene bar as the earlier scrub round. - pkg/opencost/rest_client.go: drop "the finops UI actually consumes" product framing in the Allocation type doc. - pkg/opencost/types.go: drop "Despite the historical name" diff-history phrasing on NamespaceCost — the name reflects the default aggregation, the struct doc explains the controller/pod reuse. - pkg/prom/discovery.go: drop "lightweight connector probes" — the IncludeDynamic option's purpose stands without naming a caller. Also fixes pkg/opencost/compute.go where the idle-row comment referenced "the Nodes tab hits /assets" — UI-specific reference plus factually wrong (node costs use Prometheus, not /assets) — and pkg/opencost/trend.go where defaultStep's rationale named a specific RPC-pipeline deadline with the wrong number.
Three behavior gaps the right-shape review surfaced. All three were in
code paths the extraction is specifically meant to make reusable — pin
the contracts so future "simplifications" can't quietly regress them.
- TestComputeWorkloads_OwnerLookup{Resolves,NilFallsBack,Unresolved} +
TestComputeWorkloads_EmptyResultReturnsNoMetricsReason +
TestComputeWorkloads_NilClient + TestStripPodSuffix:
Pin the PodOwnerLookup callback contract end-to-end (resolved → real
workload kind; nil → standalone via name-strip; per-pod false → fall
back per-pod; empty results → NoMetrics).
- TestComputeCostTrendFromProm_{TopNAndOther,AllUnderMaxSeriesNoOther,
EmptyNamespaceLabelSkipped,NilClient,NoSeries}:
Pin the top-N + "other" aggregation — input series ranked by latest
value, top N returned individually, remainder summed per-timestamp
into a single "other" series. Catches refactor regressions in the
cross-loop topSet state.
- TestWindowHours: 13-case table covering every branch of windowHours
including the documented "m" = minutes (not months) decision so the
ambiguity comment can't be silently "fixed" in the wrong direction.
Drive-by:
- Drop pkg/prom/client_test.go's hand-rolled errorsAs helper, use the
stdlib errors.As.
- Drop pkg/opencost/compute_test.go's strFromFloat ceremony (json.Marshal
to stringify a number), use strconv.FormatFloat.
URL path/query params (kind, namespace, name, category) flow into log lines via %s. With radar bound to 0.0.0.0 by default and no K8s-name validation enforced by the HTTP layer, an attacker reaching the port could embed \n + a forged "[prometheus] ..." prefix in a URL path to inject fake log entries (and/or terminal escapes if logs are tailed). Swapped %s → %q at every log/errorlog/Sprintf site that formats a user-controlled string. %q quotes the value and escapes control characters; the attacker's payload now logs as a single, unambiguous quoted literal rather than forging a new line. Covers internal/prometheus/handlers.go (the 3 CodeQL log-injection alerts: handlers.go:202, :347, :442 — pre-existing on main since April but visible on every PR that touches the file) and the namespace-tainted log lines added earlier in this PR in pkg/opencost/workloads.go.
All three are extraction-introduced regressions or subtle races the cursor[bot] reviewer caught after the last round. pkg/opencost/trend.go: REST trend bucketTimestamp returned t.UnixMilli() while the PromQL trend path (and pkg/prom.parseDataPoint generally) emits Unix seconds. Both feed the same CostDataPoint.Timestamp int64 field; the frontend (CostTrendChart.tsx:317) multiplies by 1000 assuming seconds. The REST path's ms values would render at year ~57658. Pre-extraction convention was seconds. Switched to t.Unix(). internal/prometheus/client.go: EnsureConnected probed only when base!="" AND cached!=nil. After markConnected returns (sets baseURL but not c.prom), the very next request races getPromClient and sees base!="" + cached==nil → falls through to a full rediscover() that wasn't needed. Same trigger after SetHeaders. Pre-extraction probed whenever base!=""; restored that behavior by building the prom client on-demand via getPromClient inside EnsureConnected. internal/prometheus/discovery.go: markConnected updated baseURL + basePath but didn't clear c.prom. If c.prom was somehow cached against a different address (race window with concurrent getPromClient + discovery landing on a different endpoint), the stale client survives. Cleared c.prom inside the same mutex block.
Two more cursor bugbot finds, both real. internal/prometheus/client.go: Query and QueryRange chained c.getPromClient().Query(...) with no nil check. If concurrent Reset() (typically on context switch) clears c.baseURL between EnsureConnected returning and getPromClient executing, getPromClient returns nil and the chain panics. Now check for nil and return a typed error. pkg/opencost/compute.go: REST ComputeCostSummary computed ClusterEfficiency as totalEff/effRows (unweighted mean of per-row efficiencies). ComputeCostSummaryFromProm computes it as totalUsageCost/totalAllocCost (cost-weighted). Same response type (CostSummary), so switching data sources changed the reported number for the same underlying cluster — a $0.01/hr namespace at 10% efficiency had the same weight as a $100/hr namespace. Now also accumulates totalAllocCost + totalUsageCost in the REST loop and uses the cost-weighted formula. The capping-then-accumulating logic is preserved (still important to clamp burstable rowEff > 1.0 before adding to the totals). Also fixed a flake in TestComputeWorkloads_OwnerLookupResolves: the test had api and worker with equal aggregate cost, so the sort order was decided by Go's map iteration. Bumped worker's value to break the tie deterministically.
After fixing the REST efficiency-formula mismatch, the same "clamp-to-100 efficiency %" and "max(alloc-usage, 0) idle" math appeared in 5+ places across compute.go and workloads.go. Extracted two small helpers: - efficiencyPct(usage, alloc) → 0..100, rounded to 1dp, with the "no data ≠ 0% efficient" gate (returns 0 only when either is non-positive, not when one happens to be small). - idleFromUsage(usage, alloc) → max(alloc - usage, 0) with the same gate. "No usage data" doesn't mean "all idle"; it means we don't know, so we report 0 idle. Call sites consolidated: - ComputeCostSummary REST per-row (was inline rowEff*100 + idle clamp) - ComputeCostSummary REST cluster (was inline cap-to-100) - ComputeCostSummaryFromProm per-row (was 4-line if-block) - ComputeCostSummaryFromProm cluster (was 8-line if-block setting both efficiency and idle) - ComputeWorkloadsFromProm per-workload (was 4-line if-block) Net -7 LOC, all tests still pass.
Pre-existing behavior only attempted port-forwarding the highest-priority candidate (`candidates[0]`). If that one's pods were missing or unreachable, discovery gave up — even when lower-priority candidates would have worked. This is exactly the scenario that bit us on the FinOps test cluster (kube-prometheus-stack-prometheus service with no backing pods masked the working prometheus-server in the opencost namespace). Now the fallback loop walks the candidate list in priority order, stopping any broken port-forward before trying the next, and surfaces the most relevant failure (lastErr) when none work. Also wires pkg/prom.Discover's internal Logger callback to the prometheus log prefix so candidate-enumeration progress is visible.
main added internal/prometheus/rightsizing.go (resource-rightsizing recommendations endpoint) while this PR was in flight. The new file used unqualified SanitizeLabelValue / escapeRegexMeta / QueryResult identifiers that were valid pre-extraction but no longer exist in internal/prometheus after the aliases.go shim was removed. - pkg/prom/queries.go: export escapeRegexMeta as EscapeRegexMeta so internal/prometheus can use it (single character of API surface — the function is a 2-line regex sanitizer with no good reason to hide it from sibling packages). - internal/prometheus/rightsizing.go: qualify the three references to use prom.SanitizeLabelValue / prom.EscapeRegexMeta / prom.QueryResult. Build + tests green.
46c0ce9 to
85012c6
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 85012c6. Configure here.
| // already connected. Returns the base URL and base path, or an error. | ||
| func (c *Client) EnsureConnected(ctx context.Context) (string, string, error) { | ||
| c.mu.RLock() | ||
| if c.baseURL != "" { |
There was a problem hiding this comment.
EnsureConnected returns stale base URL after concurrent rediscovery
Low Severity
EnsureConnected reads base and bp under RLock at lines 174-177, then calls getPromClient() which acquires its own separate RLock and reads baseURL/basePath again. If a concurrent markConnected or Reset changes the address between the two reads, getPromClient builds and probes a client against a different address than what base/bp hold — but EnsureConnected returns the original stale base/bp values to the caller. The internal Query/QueryRange methods discard the return values, and the handler layer doesn't use them either, limiting the practical impact — but the contract ("returns the base URL") is violated, and any future caller relying on the returned values would get inconsistent data.
Reviewed by Cursor Bugbot for commit 85012c6. Configure here.
| } else { | ||
| cp := *a | ||
| combined[name] = &cp | ||
| } |
There was a problem hiding this comment.
REST efficiency silently uses only first window's value when merging
Medium Severity
When ComputeCostSummary merges allocations across multiple time-window buckets (the resp.Data array), it accumulates cost fields but does not properly handle TotalEfficiency. The first bucket's allocation is copied via cp := *a, preserving its TotalEfficiency. Subsequent buckets add their costs into the existing record but never update TotalEfficiency — so the final efficiency is based only on the first bucket's value, not a cost-weighted average across all buckets. For single-bucket responses (the common case) this is fine, but multi-window responses will report incorrect per-row and cluster-level efficiency.
Reviewed by Cursor Bugbot for commit 85012c6. Configure here.


Summary
Extracts radar's cost math and Prometheus client out of
internal/into reusablepkg/opencostandpkg/promso other consumers of OpenCost can share the same battle-tested logic. Also adds a parallel REST-based cost path against OpenCost's/allocationAPI for callers that have OpenCost reachable but not scraped into Prometheus.Headline numbers:
internal/opencost/handlers.go: 688 → 147 lines (thin HTTP wrappers aroundpkg/opencost)internal/prometheus/{client,discovery,queries}.go: ~1.6k lines out, replaced by delegating wrappers + aqueries.gorename underpkg/prom/+/--balanced for the extraction itself, with additions concentrated inpkg/opencost's new REST path + ~1k lines of new test coverage./api/opencost/{summary,workloads,trend,nodes},/api/prometheus/{status,connect,...}) keep their JSON shape unchanged.pkg/prom— extracted frominternal/prometheusClient,HTTPTransport,Discover(), metric-query builders.Headers map[string]stringonHTTPTransportfor auth-protected backends (Authorization,X-Scope-OrgID, ...).Probenow distinguishesProbeReasonAuthError(401/403) from generic transport errors so callers can flag credential failures specifically.pkg/opencost— two paths into the same dataREST (
ComputeCostSummary,ComputeCostTrend) hits OpenCost's/allocationAPI:windowHours()normalizes/allocation.totalCost(summed over the window) to $/hr so callers project monthly via × 730 regardless of window. Regression test pins this.SummaryOptions.Aggregate, client-side namespace post-filter,__unallocated__row suppression,NetworkCostsurfacing.PromQL (
ComputeCostSummaryFromProm,ComputeCostTrendFromProm,ComputeWorkloadsFromProm,ComputeNodeCosts) — extracted frominternal/opencost/handlers.go:PodOwnerLookupcallback sopkg/opencoststays free ofk8s.io/client-go. radar supplies it from its informer cache; other consumers supply it from whatever pod-metadata source they have.Security
?url=override on/api/prometheus/connect. The endpoint binds to 0.0.0.0 by default and the override let any reachable caller redirect Prometheus queries (including configured--prometheus-headerauth tokens) to an arbitrary host. CodeQLgo/request-forgeryalert dismissed.%sto%qat every log/errorlog/Sprintf site in the prometheus handlers and the opencost workloads namespace logging. Prevents log forging via crafted URL path params.Error visibility
The extraction initially collapsed several error paths into typed
Reasoncodes without logging the underlying cause. Restored to match pre-extraction behavior: auth failures and empty instances go to errorlog, every Query/QueryRange/GetAllocation error logs its underlying message before collapsing toReasonQueryError, and discovery's per-candidate rejection reasons are visible again.Robustness fixes during review
pkg/opencost/trend.go: REST trend was emitting Unix-ms timestamps while PromQL trend emits Unix-seconds; both feed the sameCostDataPoint.Timestampfield. Switched REST tot.Unix().internal/prometheus/client.go:EnsureConnectedprobed only whencached != nil; replaced with on-demandgetPromClientinside the probe path. Also guardsQuery/QueryRangeagainst a nil prom client from a concurrentReset()(returns a typed error instead of panicking).internal/prometheus/discovery.go:markConnectedclearsc.promso a stale cached client can't survive a discovery that landed on a different address. Port-forward fallback now tries every candidate in priority order (was only tryingcandidates[0]), and surfaces the last failure when none succeed.pkg/opencost/compute.go: RESTClusterEfficiencywas an unweighted mean of per-row efficiencies; PromQL was a cost-weighted ratio. Same response type — switching data sources changed the reported number. REST now uses the same cost-weighted formula. Both paths shareefficiencyPct/idleFromUsagehelpers so the math stays in one place.Tests
~1k lines of new test coverage in
pkg/promandpkg/opencost— including the window-normalization regression, header pass-through, memory-scrape dedupe, thePodOwnerLookupcallback contract, top-N + "other" trend aggregation, and awindowHoursunit table.Test plan
go test ./...green in both root andpkg/submodules.Note
Medium Risk
Moderate risk due to large refactor/extraction of Prometheus discovery/client and OpenCost cost computations into new packages, which could subtly change query behavior or availability handling despite aiming to preserve API shapes. Also changes connection/probing and discovery/port-forward fallbacks, impacting how/when Prometheus is reached in different environments.
Overview
Refactors the OpenCost and Prometheus integrations by extracting the core query/compute logic out of
internal/into new reusable packagespkg/opencostandpkg/prom, and updates the existing HTTP handlers to become thin wrappers delegating into these packages.Adds a new REST-based OpenCost cost path (via
/allocation) alongside the existing PromQL-based path, including window normalization to hourly rates, idle/efficiency handling, and trend bucketing/top-N aggregation, with substantial new unit test coverage.Reworks Prometheus client/discovery to build on
pkg/prom(new transport/client/probe + shared candidate discovery/scoring), improves probe diagnostics and header handling, and removes the/prometheus/connect?url=per-request override to avoid SSRF. Logging/typing is adjusted across handlers (e.g.,%qformatting) and query helpers are redirected topkg/prombuilders/sanitizers.Reviewed by Cursor Bugbot for commit 85012c6. Bugbot is set up for automated code reviews on this repo. Configure here.